Numerical Analysis Project Presentation
|
||
|---|---|---|
|
Building Insurance Prediction With Regression Analysis
|
Regression is the measures of the average relationship between two or more variables in terms of the original units of the data. Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). If you recall Linear Regression, it is used to determine the value of a continuous dependent variable. Logistic Regression is generally used for classification purposes. Unlike Linear Regression, the dependent variable can take a limited number of values only i.e, the dependent variable is categorical. When the number of possible outcomes is only two it is called Binary Logistic Regression. |
|---|
Provides estimate of values of dependent variables from values of independent variables
Can be extended to two or more variables, which is known as multiple regression
Shows the nature of relation between two or more variables
In Linear Regression, the output is the weighted sum of inputs. Logistic Regression is a generalized Linear Regression in the sense that we don’t output the weighted sum of inputs directly, but we pass it through a function that can map any real value between 0 and 1.
We can see from the below figure that the output of the linear regression is passed through an activation function that can map any real value between 0 and 1.
Recently, there has been an increase in the number of building collapse in Lagos and major cities in Nigeria. Olusola Insurance Company offers a building insurance policy that protects buildings against damages that could be caused by a fire or vandalism, by a flood or storm.
We have been appointed as the Lead Data Analysts to build a predictive model to determine if a building will have an insurance claim during a certain period or not. You will have to predict the probability of having at least one claim over the insured period of the building.
The model will be based on the building characteristics. The target variable, Claim, is a:
if the building has at least a claim over the insured period.
if the building doesn’t have a claim over the insured period.
|
We use EDA to discover underlying patterns, spot anomalies, frame the hypothesis and check assumptions with the aim to find a good fitting model (if one exists). Let's get started with some graphical visualisations of the data. We first import the necessary libraries and then use some tools of this libraries for visualization of the data. |
|---|
# import libraries
import matplotlib.pyplot as plt # data visualization
import pandas as pd #for data importing and manupulation
import numpy as np #for data manupulation and cleaning
import seaborn as sns # data visualization
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
%matplotlib inline
sns.set(style="darkgrid")
# Sets the value of the specified options on pandas.
pd.set_option('max_colwidth', 200)
pd.set_option('max_info_rows', 1000)
# reading the data
df_train = pd.read_csv('data/train_data.csv')
df_test = pd.read_csv('data/test_data.csv')
df_vdefinition = pd.read_csv('data/VariableDescription.csv')
# checking the shape
print(df_train.shape)
print(df_test.shape)
print(df_vdefinition.shape)
test_id = df_test['Customer Id']
# view the variables description
df_vdefinition
(7160, 14) (3069, 13) (14, 2)
| Variable | Description | |
|---|---|---|
| 0 | Customer Id | Identification number for the Policy holder |
| 1 | YearOfObservation | year of observation for the insured policy |
| 2 | Insured_Period | duration of insurance policy in Olusola Insurance. (Ex: Full year insurance, Policy Duration = 1; 6 months = 0.5 |
| 3 | Residential | is the building a residential building or not |
| 4 | Building_Painted | is the building painted or not (N-Painted, V-Not Painted) |
| 5 | Building_Fenced | is the building fence or not (N-Fenced, V-Not Fenced) |
| 6 | Garden | building has garden or not (V-has garden; O-no garden) |
| 7 | Settlement | Area where the building is located. (R- rural area; U- urban area) |
| 8 | Building Dimension | Size of the insured building in m2 |
| 9 | Building_Type | The type of building (Type 1, 2, 3, 4) |
| 10 | Date_of_Occupancy | date building was first occupied |
| 11 | NumberOfWindows | number of windows in the building |
| 12 | Geo Code | Geographical Code of the Insured building |
| 13 | Claim | target variable. (0: no claim, 1: at least one claim over insured period). |
# view train data head()
df_train.head()
| Customer Id | YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Geo_Code | Claim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | H14663 | 2013 | 1.0 | 0 | N | V | V | U | 290.0 | 1 | 1960.0 | . | 1053 | 0 |
| 1 | H2037 | 2015 | 1.0 | 0 | V | N | O | R | 490.0 | 1 | 1850.0 | 4 | 1053 | 0 |
| 2 | H3802 | 2014 | 1.0 | 0 | N | V | V | U | 595.0 | 1 | 1960.0 | . | 1053 | 0 |
| 3 | H3834 | 2013 | 1.0 | 0 | V | V | V | U | 2840.0 | 1 | 1960.0 | . | 1053 | 0 |
| 4 | H5053 | 2014 | 1.0 | 0 | V | N | O | R | 680.0 | 1 | 1800.0 | 3 | 1053 | 0 |
# view test data head()
df_test.head()
| Customer Id | YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Geo_Code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | H11920 | 2013 | 1.000000 | 0 | V | N | O | R | 300.0 | 1 | 1960.0 | 3 | 3310 |
| 1 | H11921 | 2016 | 0.997268 | 0 | V | N | O | R | 300.0 | 1 | 1960.0 | 3 | 3310 |
| 2 | H9805 | 2013 | 0.369863 | 0 | V | V | V | U | 790.0 | 1 | 1960.0 | . | 3310 |
| 3 | H7493 | 2014 | 1.000000 | 0 | V | N | O | R | 1405.0 | 1 | 2004.0 | 3 | 3321 |
| 4 | H7494 | 2016 | 1.000000 | 0 | V | N | O | R | 1405.0 | 1 | 2004.0 | 3 | 3321 |
#Check for null values in train and testdata
print("Train data columns with null values are " +str(df_train.isnull().any().sum()))
print("Test data columns with null values are " +str(df_test.isnull().any().sum()))
Train data columns with null values are 4 Test data columns with null values are 4
#Null data check continues i.e. per column
df_train.isnull().sum()
Customer Id 0 YearOfObservation 0 Insured_Period 0 Residential 0 Building_Painted 0 Building_Fenced 0 Garden 7 Settlement 0 Building Dimension 106 Building_Type 0 Date_of_Occupancy 508 NumberOfWindows 0 Geo_Code 102 Claim 0 dtype: int64
#Null data check continues for test data i.e. per column
df_test.isnull().sum()
Customer Id 0 YearOfObservation 0 Insured_Period 0 Residential 0 Building_Painted 0 Building_Fenced 0 Garden 4 Settlement 0 Building Dimension 13 Building_Type 0 Date_of_Occupancy 728 NumberOfWindows 0 Geo_Code 13 dtype: int64
#View the info desciption of data
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7160 entries, 0 to 7159 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 Customer Id object 1 YearOfObservation int64 2 Insured_Period float64 3 Residential int64 4 Building_Painted object 5 Building_Fenced object 6 Garden object 7 Settlement object 8 Building Dimension float64 9 Building_Type int64 10 Date_of_Occupancy float64 11 NumberOfWindows object 12 Geo_Code object 13 Claim int64 dtypes: float64(3), int64(4), object(7) memory usage: 783.2+ KB
# copy our train data
df_train_vs = df_train.copy()
df_train_vs['Claim'].value_counts()
0 5526 1 1634 Name: Claim, dtype: int64
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'No Claim', 'Claim'
sizes = df_train_vs['Claim'].value_counts()
explode = (0, 0.1) # only "explode" the 2nd slice (i.e. 'Hogs')
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
plt.figure(figsize =(8,5))
# visualize the counts of each Garden type for each claim type
# V and O indicates "has garden" or "no garden"
sns.countplot(df_train_vs['Garden'], hue='Claim', data=df_train_vs)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
plt.figure(figsize =(8,5))
# visualize the counts of each building painted type for each claim type
# N and V indicates "Building Painted" or "Not Painted"
sns.countplot(df_train_vs['Building_Painted'], hue='Claim', data=df_train_vs)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
plt.figure(figsize =(8,5))
# visualize the counts of each settlement type for each claim type
# R and U indicates "Rural Area" or "Urban Area"
sns.countplot(df_train_vs['Settlement'], hue='Claim', data=df_train_vs)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
plt.figure(figsize =(8,5))
# visualize the counts of each building fenced type for each claim type
# N and V indicates "Building Fenced" or "Not Fenced"
sns.countplot(df_train_vs['Building_Fenced'], hue='Claim', data=df_train_vs)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
plt.figure(figsize =(8,5))
# visualize the counts of each residential type for each claim type
# 1 and 0 indicates "residential building" or "not residential"
sns.countplot(df_train_vs['Residential'], hue='Claim', data=df_train_vs)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
# visualize the counts of each Year Of Observation for each claim type
sns.countplot(x='YearOfObservation', data=df_train_vs, hue='Claim')
plt.show()
# visualizing data distributions of insured period.
plt.figure(figsize =(10,5))
df_train_vs['Insured_Period'].hist()
plt.show()
# visualizing data distributions of date of occupancy.
plt.figure(figsize =(10,5))
sns.distplot(df_train_vs['Date_of_Occupancy'], bins=25, kde=False)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
# counts for each building types
sns.countplot(x='Building_Type', data=df_train_vs)
plt.show()
# visualize the counts of each building type for each claim type
sns.countplot(x='Building_Type', data=df_train_vs, hue='Claim')
plt.show()
# visualizing data distributions of Building Dimension in m2.
plt.figure(figsize =(10,5))
sns.distplot(df_train_vs['Building Dimension'], bins=20)
plt.show()
C:\Users\TJ\anaconda3\envs\dl_teejay\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
#heatmap showing feature correlation
correlation_matrix = df_train_vs.corr().round(2)
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()
|
Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm. Feature encoding is basically performing transformations on the data such that it can be easily accepted as input for machine learning algorithms while still retaining its original meaning. Missing values is very much usual to have missing values in our dataset, regardless missing values must be taken into consideration Eliminate rows with missing data Estimate missing values |
|---|
# analyze the relationship between claim
# and residential values
pd.crosstab(df_train_vs['Residential'],df_train_vs['Claim'], normalize=True) * 100
| Claim | 0 | 1 |
|---|---|---|
| Residential | ||
| 0 | 54.832402 | 14.622905 |
| 1 | 22.346369 | 8.198324 |
# analyze the relationship between Building fenced
# and settlement values
pd.crosstab(df_train_vs['Settlement'],df_train_vs['Building_Fenced'])
| Building_Fenced | N | V |
|---|---|---|
| Settlement | ||
| R | 3608 | 2 |
| U | 0 | 3550 |
ct = pd.crosstab(df_train_vs['Residential'],df_train_vs['Claim'], normalize=True) * 100
ct.plot.bar(stacked=True)
plt.show()
# analyze the relationship between Building painted
# and settlement values
pd.crosstab(df_train_vs['Settlement'],df_train_vs['Building_Painted'])
| Building_Painted | N | V |
|---|---|---|
| Settlement | ||
| R | 7 | 3603 |
| U | 1771 | 1779 |
df_train.head()
| Customer Id | YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Geo_Code | Claim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | H14663 | 2013 | 1.0 | 0 | N | V | V | U | 290.0 | 1 | 1960.0 | . | 1053 | 0 |
| 1 | H2037 | 2015 | 1.0 | 0 | V | N | O | R | 490.0 | 1 | 1850.0 | 4 | 1053 | 0 |
| 2 | H3802 | 2014 | 1.0 | 0 | N | V | V | U | 595.0 | 1 | 1960.0 | . | 1053 | 0 |
| 3 | H3834 | 2013 | 1.0 | 0 | V | V | V | U | 2840.0 | 1 | 1960.0 | . | 1053 | 0 |
| 4 | H5053 | 2014 | 1.0 | 0 | V | N | O | R | 680.0 | 1 | 1800.0 | 3 | 1053 | 0 |
df_test.head()
| Customer Id | YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Geo_Code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | H11920 | 2013 | 1.000000 | 0 | V | N | O | R | 300.0 | 1 | 1960.0 | 3 | 3310 |
| 1 | H11921 | 2016 | 0.997268 | 0 | V | N | O | R | 300.0 | 1 | 1960.0 | 3 | 3310 |
| 2 | H9805 | 2013 | 0.369863 | 0 | V | V | V | U | 790.0 | 1 | 1960.0 | . | 3310 |
| 3 | H7493 | 2014 | 1.000000 | 0 | V | N | O | R | 1405.0 | 1 | 2004.0 | 3 | 3321 |
| 4 | H7494 | 2016 | 1.000000 | 0 | V | N | O | R | 1405.0 | 1 | 2004.0 | 3 | 3321 |
df_train['Building_Painted'] = df_train['Building_Painted'].replace({'N':1, 'V':0})
df_train['Building_Fenced'] = df_train['Building_Fenced'].replace({'N':1, 'V':0})
df_train['Garden'] = df_train['Garden'].replace({'V':1, 'O':0})
df_train['Settlement'] = df_train['Settlement'].replace({'U':1, 'R':0})
df_test['Building_Painted'] = df_test['Building_Painted'].replace({'N':1, 'V':0})
df_test['Building_Fenced'] = df_test['Building_Fenced'].replace({'N':1, 'V':0})
df_test['Garden'] = df_test['Garden'].replace({'V':1, 'O':0})
df_test['Settlement'] = df_test['Settlement'].replace({'U':1, 'R':0})
df_train.isnull().sum()
Customer Id 0 YearOfObservation 0 Insured_Period 0 Residential 0 Building_Painted 0 Building_Fenced 0 Garden 7 Settlement 0 Building Dimension 106 Building_Type 0 Date_of_Occupancy 508 NumberOfWindows 0 Geo_Code 102 Claim 0 dtype: int64
df_test.isnull().sum()
Customer Id 0 YearOfObservation 0 Insured_Period 0 Residential 0 Building_Painted 0 Building_Fenced 0 Garden 4 Settlement 0 Building Dimension 13 Building_Type 0 Date_of_Occupancy 728 NumberOfWindows 0 Geo_Code 13 dtype: int64
#Handling missing values for train data
#Inpute values for Garden using mode
df_train['Garden'] = df_train['Garden'].fillna(df_train['Garden'].mode()[0])
#Inpute values for Building Dimension using mean
df_train['Building Dimension'] = df_train['Building Dimension'].fillna(df_train['Building Dimension'].mean())
#Inpute values for Date_of_Occupancy using mean
df_train['Date_of_Occupancy'] = df_train['Date_of_Occupancy'].fillna(df_train['Date_of_Occupancy'].mean())
#Inpute values for Geo_Code using mode
df_train['Geo_Code'] = df_train['Geo_Code'].fillna(df_train['Geo_Code'].mode()[0])
df_train.isnull().sum()
Customer Id 0 YearOfObservation 0 Insured_Period 0 Residential 0 Building_Painted 0 Building_Fenced 0 Garden 0 Settlement 0 Building Dimension 0 Building_Type 0 Date_of_Occupancy 0 NumberOfWindows 0 Geo_Code 0 Claim 0 dtype: int64
df_test.isnull().sum()
Customer Id 0 YearOfObservation 0 Insured_Period 0 Residential 0 Building_Painted 0 Building_Fenced 0 Garden 4 Settlement 0 Building Dimension 13 Building_Type 0 Date_of_Occupancy 728 NumberOfWindows 0 Geo_Code 13 dtype: int64
#Handling Missing values for test data
#Inpute values for Garden using mode
df_test['Garden'] = df_test['Garden'].fillna(df_test['Garden'].mode()[0])
#Input values for Building Dimension using mean
df_test['Building Dimension'] = df_test['Building Dimension'].fillna(df_test['Building Dimension'].mean())
#Input values for Date_of_Occupancy using mean
df_test['Date_of_Occupancy'] = df_test['Date_of_Occupancy'].fillna(df_test['Date_of_Occupancy'].mean())
#Input values for Geo_Code using mode
df_test['Geo_Code'] = df_test['Geo_Code'].fillna(df_test['Geo_Code'].mode()[0])
df_train.head()
| Customer Id | YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Geo_Code | Claim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | H14663 | 2013 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 290.0 | 1 | 1960.0 | . | 1053 | 0 |
| 1 | H2037 | 2015 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 490.0 | 1 | 1850.0 | 4 | 1053 | 0 |
| 2 | H3802 | 2014 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 595.0 | 1 | 1960.0 | . | 1053 | 0 |
| 3 | H3834 | 2013 | 1.0 | 0 | 0 | 0 | 1.0 | 1 | 2840.0 | 1 | 1960.0 | . | 1053 | 0 |
| 4 | H5053 | 2014 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 680.0 | 1 | 1800.0 | 3 | 1053 | 0 |
df_train['NumberOfWindows'].value_counts()
. 3551 4 939 3 844 5 639 2 363 6 306 7 211 8 116 1 75 >=10 67 9 49 Name: NumberOfWindows, dtype: int64
df_test['NumberOfWindows'].value_counts()
. 2240 3 227 4 194 5 151 2 70 6 70 7 54 8 26 1 16 >=10 11 9 10 Name: NumberOfWindows, dtype: int64
df_train['NumberOfWindows'] = df_train['NumberOfWindows'].apply (pd.to_numeric, errors='coerce')
df_train['NumberOfWindows'] = df_train['NumberOfWindows'].fillna(0)
df_test['NumberOfWindows'] = df_test['NumberOfWindows'].apply (pd.to_numeric, errors='coerce')
df_test['NumberOfWindows'] = df_test['NumberOfWindows'].fillna(0)
# annot = True to print the values inside the square
plt.figure(figsize =(10,8))
sns.heatmap(data=df_train.corr().round(2), annot=True)
plt.show()
|
Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in Univariate selection method Correlation Matrix with Heatmap |
|---|
df_train.columns
Index(['Customer Id', 'YearOfObservation', 'Insured_Period', 'Residential',
'Building_Painted', 'Building_Fenced', 'Garden', 'Settlement',
'Building Dimension', 'Building_Type', 'Date_of_Occupancy',
'NumberOfWindows', 'Geo_Code', 'Claim'],
dtype='object')
#Drop Customer Id column for train data and test data
df_train = df_train.drop(['Customer Id'], axis=1)
df_train.head()
| YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Geo_Code | Claim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 290.0 | 1 | 1960.0 | 0.0 | 1053 | 0 |
| 1 | 2015 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 490.0 | 1 | 1850.0 | 4.0 | 1053 | 0 |
| 2 | 2014 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 595.0 | 1 | 1960.0 | 0.0 | 1053 | 0 |
| 3 | 2013 | 1.0 | 0 | 0 | 0 | 1.0 | 1 | 2840.0 | 1 | 1960.0 | 0.0 | 1053 | 0 |
| 4 | 2014 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 680.0 | 1 | 1800.0 | 3.0 | 1053 | 0 |
#Drop Geo_Code column for train data and test data
df_train = df_train.drop(['Geo_Code'], axis=1)
df_test = df_test.drop(['Geo_Code'], axis=1)
#df_train = df_train.drop(['NumberOfWindows'], axis=1)
#df_test = df_test.drop(['NumberOfWindows'], axis=1)
df_train.head()
| YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | Claim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 290.0 | 1 | 1960.0 | 0.0 | 0 |
| 1 | 2015 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 490.0 | 1 | 1850.0 | 4.0 | 0 |
| 2 | 2014 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 595.0 | 1 | 1960.0 | 0.0 | 0 |
| 3 | 2013 | 1.0 | 0 | 0 | 0 | 1.0 | 1 | 2840.0 | 1 | 1960.0 | 0.0 | 0 |
| 4 | 2014 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 680.0 | 1 | 1800.0 | 3.0 | 0 |
x = df_train.drop(['Claim'],axis = 1)
y = df_train.Claim
x.head()
| YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 290.0 | 1 | 1960.0 | 0.0 |
| 1 | 2015 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 490.0 | 1 | 1850.0 | 4.0 |
| 2 | 2014 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 595.0 | 1 | 1960.0 | 0.0 |
| 3 | 2013 | 1.0 | 0 | 0 | 0 | 1.0 | 1 | 2840.0 | 1 | 1960.0 | 0.0 |
| 4 | 2014 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 680.0 | 1 | 1800.0 | 3.0 |
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import classification_report
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(x,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(x.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(11,'Score').round(2)) #print 11 best features
Specs Score 7 Building Dimension 1693570.88 10 NumberOfWindows 266.14 8 Building_Type 36.46 2 Residential 20.06 5 Garden 9.82 6 Settlement 9.77 4 Building_Fenced 9.48 3 Building_Painted 4.80 1 Insured_Period 3.56 9 Date_of_Occupancy 1.51 0 YearOfObservation 0.00
x = df_train.drop(['Claim','YearOfObservation','Date_of_Occupancy'],axis = 1)
y = df_train.Claim
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=4)
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(x_train,y_train)
LogisticRegression(max_iter=1000)
y_pred = logistic_regression.predict(x_test)
# performance measurement for machine learning classification problem where output can be two or more classes
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
array([[1379, 14],
[ 339, 58]], dtype=int64)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.80 0.99 0.89 1393
1 0.81 0.15 0.25 397
accuracy 0.80 1790
macro avg 0.80 0.57 0.57 1790
weighted avg 0.80 0.80 0.74 1790
y_pred_proba = logistic_regression.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4)
plt.show()
import shap
shap.initjs()
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=0)
Xv_train = vectorizer.fit_transform(x_train)
Xv_test = vectorizer.transform(x_test)
explainer = shap.LinearExplainer( logistic_regression, Xv_train,feature_dependence="independent")
shap_values = explainer.shap_values(Xv_test)
X_test_array = Xv_test.toarray()
shap.summary_plot(shap_values, X_test_array, feature_names=vectorizer.get_feature_names())
ind = 0
shap.force_plot(
explainer.expected_value, shap_values[ind,:], X_test_array[ind,:],
feature_names=vectorizer.get_feature_names()
)
ind = 1
shap.force_plot(
explainer.expected_value, shap_values[ind,:], X_test_array[ind,:],
feature_names=vectorizer.get_feature_names()
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
from pydot import graph_from_dot_data
from pydotplus.graphviz import graph_from_dot_data
RF_x = df_train.drop(['Claim'],axis = 1)
RF_y = pd.get_dummies(y)
RF_x.head()
| YearOfObservation | Insured_Period | Residential | Building_Painted | Building_Fenced | Garden | Settlement | Building Dimension | Building_Type | Date_of_Occupancy | NumberOfWindows | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 290.0 | 1 | 1960.0 | 0.0 |
| 1 | 2015 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 490.0 | 1 | 1850.0 | 4.0 |
| 2 | 2014 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 595.0 | 1 | 1960.0 | 0.0 |
| 3 | 2013 | 1.0 | 0 | 0 | 0 | 1.0 | 1 | 2840.0 | 1 | 1960.0 | 0.0 |
| 4 | 2014 | 1.0 | 0 | 0 | 1 | 0.0 | 0 | 680.0 | 1 | 1800.0 | 3.0 |
x_train, x_test, y_train, y_test = train_test_split(RF_x, RF_y, random_state=1)
random_forest = RandomForestClassifier(criterion='entropy', oob_score=True, random_state=1,n_estimators=100)
random_forest.fit(x_train, y_train)
RandomForestClassifier(criterion='entropy', oob_score=True, random_state=1)
featureNamess = df_train.drop(['Claim'],axis = 1).columns
classNames = 'Claim'
feature_imp = pd.Series(random_forest.feature_importances_,index=featureNamess).sort_values(ascending=False)
feature_imp
Building Dimension 0.467197 Date_of_Occupancy 0.170856 YearOfObservation 0.104602 Insured_Period 0.076034 Building_Type 0.065355 NumberOfWindows 0.063161 Residential 0.025859 Building_Painted 0.016845 Garden 0.003573 Settlement 0.003570 Building_Fenced 0.002949 dtype: float64
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
x = df_train.drop(['Claim','Garden','Building_Fenced','Settlement'],axis = 1)
y = df_train.Claim
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=4)
random_forest = RandomForestClassifier(criterion='entropy', oob_score=True, random_state=1,n_estimators=100)
random_forest.fit(x_train, y_train)
y_pred=random_forest.predict(x_test)
print("The classification report for the model : \n\n"+ classification_report(y_test, y_pred))
The classification report for the model :
precision recall f1-score support
0 0.82 0.92 0.86 1393
1 0.49 0.27 0.35 397
accuracy 0.78 1790
macro avg 0.65 0.59 0.60 1790
weighted avg 0.74 0.78 0.75 1790
featureNames = df_train.drop(['Claim','Garden','Building_Fenced','Settlement'],axis = 1).columns
classNames = 'Claim'
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
estimator = random_forest.estimators_[0]
# Export as dot file
dot_data = StringIO()
export_graphviz(estimator, out_file=dot_data,
feature_names = featureNames,
class_names = ['0','1'],
rounded = True, proportion = False,
precision = 2, filled = True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) # Create graph from dot data
graph.write_png('tree.png')
Image(graph.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.453084 to fit dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.453084 to fit
# performance measurement for machine learning classification problem where output can be two or more classes
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
array([[1282, 111],
[ 291, 106]], dtype=int64)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print("The classification report for the model : \n\n"+ classification_report(y_test, y_pred))
The classification report for the model :
precision recall f1-score support
0 0.82 0.92 0.86 1393
1 0.49 0.27 0.35 397
accuracy 0.78 1790
macro avg 0.65 0.59 0.60 1790
weighted avg 0.74 0.78 0.75 1790
import shap
shap.initjs()
shap_values = shap.TreeExplainer(random_forest).shap_values(x_train)
shap.summary_plot(shap_values[1], x_train, plot_type="bar")
explainer = shap.TreeExplainer(random_forest)
shap_value = explainer.shap_values(x_test.iloc[0])
shap.force_plot(explainer.expected_value[0], shap_value[0], x_test.iloc[0])
shap_values = explainer.shap_values(x_test)
shap.force_plot(explainer.expected_value[0], shap_values[0], x_test)
shap.summary_plot(shap_values[1], x_test)
shap.decision_plot(explainer.expected_value[1], shap_values[1], x_test)
from sklearn.tree import DecisionTreeClassifier
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
tree = DecisionTreeClassifier(criterion='entropy', # Initialize and fit classifier
max_depth=4, random_state=0)
modelTree = tree.fit(x_train, y_train)
y_pred = tree.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("The classification report for the model : \n\n"+ classification_report(y_test, y_pred))
Accuracy: 0.7810055865921788
The classification report for the model :
precision recall f1-score support
0 0.80 0.96 0.87 1373
1 0.59 0.20 0.30 417
accuracy 0.78 1790
macro avg 0.69 0.58 0.59 1790
weighted avg 0.75 0.78 0.74 1790
# performance measurement for machine learning classification problem where output can be two or more classes
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
array([[1313, 60],
[ 332, 85]], dtype=int64)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print("The classification report for the model : \n\n"+ classification_report(y_test, y_pred))
The classification report for the model :
precision recall f1-score support
0 0.80 0.96 0.87 1373
1 0.59 0.20 0.30 417
accuracy 0.78 1790
macro avg 0.69 0.58 0.59 1790
weighted avg 0.75 0.78 0.74 1790
featureNames = df_train.drop(['Claim','Garden','Building_Fenced','Settlement'],axis = 1).columns
classNames = 'Claim'
dot_data = StringIO()
export_graphviz(tree, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = featureNames,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('insurance.png')
Image(graph.create_png())
import shap
shap.initjs()
shap_values = shap.TreeExplainer(modelTree).shap_values(x_train)
shap.summary_plot(shap_values[1], x_train, plot_type="bar")
explainer = shap.TreeExplainer(modelTree)
shap_value = explainer.shap_values(x_test.iloc[0])
shap.force_plot(explainer.expected_value[0], shap_value[0], x_test.iloc[0])
shap_values = explainer.shap_values(x_test)
shap.force_plot(explainer.expected_value[0], shap_values[0], x_test)
shap.summary_plot(shap_values[1], x_test)
shap.decision_plot(explainer.expected_value[1], shap_values[1], x_test)
x = df_test.drop(['Customer Id','YearOfObservation','Date_of_Occupancy'],axis = 1)
y_pred = logistic_regression.predict(x)
d = {"Customer Id": test_id, 'Claim': y_pred}
test_predictions = pd.DataFrame(data=d)
test_predictions = test_predictions[["Customer Id", 'Claim']]
test_predictions.head()
| Customer Id | Claim | |
|---|---|---|
| 0 | H11920 | 0 |
| 1 | H11921 | 0 |
| 2 | H9805 | 0 |
| 3 | H7493 | 0 |
| 4 | H7494 | 0 |
test_predictions.to_csv('testsubmission.csv', index=False)
|
To conclude, You now know what regression is and how you can implement it for classification with Python We also explore the rule and intuition behind logistic regression to better explain the mathematical relationship between python developed model Finally, we discuss the performance of our simple regression analysis compared to other basic machine learning techniques built on the concept of regression. |
|---|